Incremental Entity Blocking over Heterogeneous Streaming Data

نویسندگان

چکیده

Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has key solution for integrating multiple data sources or identifying similarities between items, namely entities. To avoid the quadratic costs ER task improve efficiency, blocking techniques are usually applied. Beyond traditional challenges faced by and, consequently, techniques, there also related to data, incremental processing, noisy address them, we propose schema-agnostic technique capable handling incrementally through distributed computational infrastructure. best our knowledge, is lack that these simultaneously. This work proposes two strategies (attribute selection top-n neighborhood entities) minimize resource consumption efficiency. Moreover, presents noise-tolerant algorithm, which minimizes impact (e.g., typos misspellings) on effectiveness. experimental evaluation, use real-world pairs sources, including case study involves from Twitter Google News. The proposed achieves better results regarding effectiveness efficiency compared state-of-the-art (metablocking). More precisely, application over alone improves 56%, average.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel meta-blocking for scaling entity resolution over big heterogeneous data

Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suffices to perform comparisons only within each block. To further increase efficiency, Meta-blocking is being used...

متن کامل

Streaming Queries over Streaming Data

Recent work on querying data streams has focused on systems where newly arriving data is processed and continuously streamed to the user in real-time. In many emerging applications, however, ad hoc queries and/or intermittent connectivity also require the processing of data that arrives prior to query submission or during a period of disconnection. For such applications, we have developed PSoup...

متن کامل

Named Entity Disambiguation in Streaming Data

The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the trainin...

متن کامل

Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking

Entity Resolution constitutes a quadratic task that typically scales to large entity collections through blocking. The resulting blocks can be restructured by Meta-blocking in order to significantly increase precision at a limited cost in recall. Yet, its processing can be time-consuming, while its precision remains poor for configurations with high recall. In this work, we propose new meta-blo...

متن کامل

Optimizing Sampling-based Entity Resolution over Streaming Documents

Increasingly, organizations have employed methods to understand unstructured text across the web. Entity resolution is used to identify mentions in large, streaming text corpora. Sampling-based entity resolution using Markov Chain Monte Carlo (MCMC) techniques guarantees convergence to a stationary distribution and can jump out of a local optimum. When performing entity resolution over streams ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Information

سال: 2022

ISSN: ['2078-2489']

DOI: https://doi.org/10.3390/info13120568